home *** CD-ROM | disk | FTP | other *** search
- "WAIS Corporate Paper version 3"
- MS-Word version available for anonymouse ftp from think.com in
- /pub/wais/wais-overview-docs.sit.hqx. This file is
- wais-corporate-paper.text
-
-
- An Information System for Corporate Users: Wide Area Information Servers
-
-
- Brewster Kahle Thinking Machines Corporation
- Art Medlar Scolex Information Systems 8 April 1991
-
-
- To explore text-based information systems for corporate executives, four
- companies have jointly developed a prototype which gives flexible access to
- full-text documents. The four participating companies are Dow Jones & Co.,
- with its premier business information sources; Thinking Machines
- Corporation, with its high-end information retrieval engines; Apple
- Computer, with its user interface expertise; and KPMG Peat Marwick, with
- its information-hungry user base.
-
- One of the primary objectives of the project is to allow a user to retrieve
- personal, corporate, and wide area information through one easy-to-use
- interface. For example, instead of using Lotus Magellean(tm) for personal
- information, Verity Topic(tm) for corporate data, and Mead Data Dialog(tm)
- for published text, one application can access all three categories of
- information. The user isn't required to become familiar with several
- entirely different systems. In addition, since the interface consolidates
- data from many different sources, they can be manipulated effortlessly,
- virtually without regard to their origins.
-
- The Wide Area Information Server (WAIS, pronounced "ways") project is an
- experimental venture seeking to determine whether current technologies can
- be used to make profitable end-user full-text information systems. Fifteen
- users have been actively using the system for over three months. They have
- integrated it into their workday routine in much the same way as they have
- previously integrated spreadsheets and word processors. This preliminary
- success has convinced us that a WAIS-like system can be a valuable tool for
- corporate information retrieval. This paper discusses the design and
- implementation of the prototype system.
-
-
- Introduction
-
- Electronic publishing is the distribution of textual
- information over electronic networks. It has been emerging as a viable
- alternative to traditional print publishing as the necessary underlying
- technologies develop. Among the more essential of these are:
-
- * High Resolution Display Screens
- * Reliable, High-Speed Data Communications
- * Desktop Publishing Systems
- * Inexpensive Data
- * Storage Media
-
- While these technologies have been developed for uses other than electronic
- publishing, they are the necessary precursors for full-text retrieval
- systems.
-
- From the user's point of view, there are several problems to be overcome.
- First, there must be some way of finding and selecting databases from a
- potentially unlimited pool. Second, although these databases my be
- organized in different ways, the user should not need to become familiar
- with the internal configuration of each one. Finally, there must be some
- practical way of organizing responses on the users machine in order to
- maintain control over what may become a vast accumulation of data. In
- addition, developers are faced with a number of architectural issues. The
- system must be scalable; that is, it must allow for the future growth of
- both the complexity and number of clients and servers. It must be secure;
- each server's data must be protected from corruption, and the privacy of
- the users must be ensured. Lastly, since an unreliable source is useless
- in a corporate environment, access must be thoroughly robust.
-
- System Overview
-
- The prototype WAIS system takes advantage of current state-of-the-art
- technology, and presents solutions to all of the above problems. The
- system is composed of three separate parts: Clients, Servers, and the
- Protocol which connects them.
-
- The Client is the user interface, the server does the indexing and
- retrieval of documents, and the protocol is used to transmit the queries
- and responses, The client and server are isolated from each other through
- the protocol. Any client which is capable of translating a users request
- into the standard protocol can be used in the system. Likewise, any server
- capable of answering a request encoded in the protocol can be used. In
- order to promote the development of both clients and servers, the protocol
- specification is public, as is its initial implementation.
-
- On the client side, questions are formulated as English language questions.
- The client application then translates the query into the WAIS protocol,
- and transmits it over a network to a server. The server receives the
- transmission, translates the received packet into its own query language,
- and searches for documents satisfying the query. The list of relevant
- documents are then encoded in the protocol, and transmitted back to the
- client. The client decodes the response, and displays the results. The
- documents can then be retrieved from the server.
-
-
- Digital Researcher
-
- The traditional information research scenario is familiar to anyone who has
- ever visited a reference desk at a public or corporate library. The client
- approaches a librarian with a description of needed information. The
- librarian might ask a few background questions, and then draws from
- appropriate sources to provide an initial selection of articles, reports,
- and references. The client then sorts through this selection to find the
- most pertinent documents. With feedback from these trials, the researcher
- can refine the materials and even continue to supply the user with a flow
- of information as it becomes available. Monitoring which articles were
- useful can help keep the researcher on-track.
-
- The WAIS system is an attempt at automating this interaction: the user
- states a question in English, and a set of document descriptions come back
- from selected sources. The user can examine any of the items, be they text,
- picture, video, sound, or whatever. If the initial response is incomplete
- or somehow insufficient, the user can refine the question by stating it
- differently.
-
- In addition, the user may also mark some of the retrieved documents as
- being "relevant" to the question at hand, and then re-run the search. The
- server recognizes the marked documents, and attempts to find others which
- are similar to them. In the present WAIS system, "similar" documents are
- simply ones which share a large number of common words; however, there is
- potentially no upper limit on the intelligence of a server in determining
- what similarity entails. This method of information retrieval is called
- "relevance feedback." The idea has been around for many years (1) and the
- first commercial system utilizing it, DowQuest (2), was voted Database of
- the Year by Online Magazine in January 1989.
-
-
- User Interfaces: Asking Questions
-
- Users interact with the WAIS system through the Question interface. The
- interface may appear different on various implementations: for example, a
- character display terminal will have a different look than one which is
- capable of displaying bit-mapped graphics. The key, however, is that the
- user need only become familiar with one interface which provides access to
- all available information sources.
-
- The WAIS system, in this first incarnation, was designed to be used by
- accountants and corporate executives who are relatively untrained in search
- techniques. Consequently, to aid those users who have neither the time nor
- desire to learn a special purpose query language, the system uses English
- language queries augmented with relevance feedback. While the system's
- servers currently do not extract semantic information from the English
- queries, they do their best to find and rank articles containing the
- requested words and phrases. Used in conjunction with relevance feedback,
- this method of searching has proven to be more than adequate for the types
- of searches and databases typically encountered.
-
- The illustrations here are taken from the initial WAIStation program
- produced at Thinking Machines for the Apple Macintosh. Several other
- interfaces are under development at Apple Computer, Dow Jones, and
- elsewhere. [omitted in text-only version]
-
- * Step 1: Sources are dragged with the mouse into the Question Window. A
- question can contain multiple sources. When the question is run, it asks
- for information from each included source.
-
- * Step 2: When a query is run, headlines of documents satisfying the query
- are displayed.
-
- * Step 3: With the mouse, the user clicks on any result document to
- retrieve it.
-
- * Step 4: To refine the search, any one or more of the result
- documents can moved to the "Which are similar to:" box. When the
- search is run again, the results will be updated to include documents which
- are "similar" to the ones selected. Contacting Remote Sources of
- Information [figure omitted] Figure 1: The Source description contains all
- the necessary information for contacting an information server.
-
-
- From the user's point of view, a server is a source of information. It
- can be located anywhere that one's workstation has access to: on the
- local machine, on a network, or on the other side of a modem. The
- user's workstation keeps track of a variety of information about each
- server. The public information about a server includes how to contact it,
- a description of the contents, and the cost. In addition, individual users
- maintain certain private information about the servers they use. Users
- need to budget the money they are willing to spend on information from
- particular servers, they need to know how often and when each server is
- contacted, and they need to assess the relative usefulness of each server.
- This information helps guide the workstation in making cost effective
- decisions in contacting servers.
-
- With most current retrieval systems, complications develop as soon as one
- begins dealing with more than one source of information. The most common
- problem is that of asking a particular question. For example, one contacts
- the first source, asks it for information on some topic, contacts the next
- source, asks it the same questions (most likely using a different query
- language, a different style of interface, a different system of billing),
- contacts the next source, and so on. One of the primary motivations behind
- the initial development of the WAIS system was to replace replace all this
- with a single interface.
-
- With WAIS, the user selects a set of sources to query for information, and
- then formulates a question. When the question is run, the system
- automatically asks all the servers for the required information with no
- further interaction necessary by the user. The documents returned are
- sorted and consolidated in a single place. to be easily manipulated by the
- user. The user has transparent access to a multitude of local and remote
- databases.
-
-
- Rerunning Questions -- A Personal Newspaper
-
- In addition to providing interactive access to a vast quantity of
- information, the WAIS system can also be used as a rudimentary personal
- newspaper. A virtually unlimited number of queries can be saved, and
- updated at periodic intervals. To do this, the user's workstation is
- directed to contact each server at certain set times. When a source of
- information is contacted, any questions referencing that source are updated
- with new documents. The users can then easily browse through the results
- the next morning.
-
- To make the ideal electronic personal newspaper, a system designer would
- need certain technologies which are not available today. Most computer
- screens are too small to allow efficient browsing of large amounts of text.
- Additionally, current data transmission speeds do not allow fast enough
- scanning if the text is not resident on the user's machine.
-
- Despite current limitations, the WAIS system employs a number of features
- which will be found in the personal newspaper of the future:
-
- * Cleardisplays of which questions have new documents.
- * Searches performed at night to hide communications delays.
- * Documents stored on disk for future reference.
- * Tools provided to quickly view stored documents.
-
- With these techniques, we have established a foundation of user support and
- acceptance.
-
-
- Servers
-
- The WAIS system was designed to be used by those who wish to sell
- information, as well as those who want to buy it. It provides a
- straightforward mechanism for indexing large amounts of data, making it
- available, and advertising the availability.
-
- The system is flexible enough to provide for a variety of billing methods.
- A small database maintainer might make the information available through a
- telephone connection. Using a 900 number, the billing would be taken care
- of by the phone company. A slightly more sophisticated site might have a
- password and credit card billing system. High volume servers might want to
- set up flat fee contracts with customers. Other methods will certainly
- emerge as use increases. The system was designed to be as adaptable as
- possible to future financial arrangements.
-
- As the dissemination of information becomes easier, questions of ownership,
- copyright, and theft of data must be addressed. These issues confront the
- entire information processing field, and are particularly acute here. The
- WAIS system is designed to keep control of the data in the hands of the
- servers. A server can choose to whom and when the data should be given.
- Documents are distributed with an explicit copyright disposition in their
- internal format. This is not to say that theft can not occur, but if a
- client starts to resell another's data, standard copyright laws can be
- invoked.
-
-
- The Directory of Servers
-
- As the WAIS system develops, sources of information will proliferate,
- making it impossible for any user to keep track of all servers that may be
- available at any one time. To help solve this problem, Thinking Machines
- is maintaining a Directory of Servers in a universally accessible location.
- The Directory of Servers contains indexed textual descriptions of all known
- servers. It is queried just like any other source. Instead of text
- documents, however, it returns source structures, specially formatted files
- which can be plugged into a question and used for queries.
-
- For example, suppose you needed information concerning the current gross
- national product of Mali, but had no idea where to find it. You might
- first ask the directory of servers for "information about the current
- economic condition of Mali." The directory would would return several
- documents, among them might be a source for the World Factbook, an on-line
- almanac maintained by the CIA. You would then use this document as the
- source field of a question, and re-run the query. This time, the system
- would contact the almanac, ask for the information, and return a document
- with the data you need.
-
- Additionally. the Directory of Servers provides a means for information
- providers to advertise the availability of their data. When a new source
- becomes available, the developers can submit a textual description, along
- with the necessary information for contacting the server. This information
- is added to the directory, and becomes available to the public.
-
-
- A Common Protocol for Information Retrieval
-
- One of the most far reaching aspects of this project is the development of
- an open protocol. The four companies have jointly specified a standard
- protocol for information retrieval. Creating a market where new servers
- can be readily established requires an open, publicly available protocol.
- Ideally this protocol would be an internationally standardized, yet
- flexible enough to adapt to new ideas and technologies; functioning over
- any electronic network, from the highest speed optical connections to phone
- lines.
-
- The use of an open and versatile protocol fosters hardware independence.
- This not only provides for a much wider base of users, it allows the system
- to seamlessly evolve over time as hardware technology progresses. It
- provides incentive to produce the best components possible. For example,
- the protocol provides for the transmission of audio and video as well as
- text, even though at present most workstations are unable to handle them.
- However, they are free to ignore pictures and sound returned in response to
- question, and to display and retrieve only text. This inability, though,
- does not hinder higher-end platforms from exploiting their greater
- processing power and network bandwidth.
-
- The WAIS protocol is an extension of the existing Z39.50 standard from
- NISO (3). It has been augmented where necessary to incorporate many of the
- needs of a full-text information retrieval system (4). To allow future
- flexibility, the standard does not restrict the query language or the data
- format of the information to be retrieved. Nonetheless, a query convention
- has been established for the existing servers and clients. The resulting
- WAIS Protocol is general enough to be implemented on a variety of
- communications systems.
-
- The success of a WAIS-like system depends on a critical mass of users and
- information services. In order to encourage development and use, Thinking
- Machines is not only publishing a specification for the protocol, but is
- also making the source code for a WAIS Protocol implementation freely
- available. While this software is available at no cost, it comes with no
- support. We hope that it will facilitate others in developing servers and
- clients.
-
-
- Future
-
- In developing the WAIS system, the participating companies have
- demonstrated that current hardware technology can be effectively used to
- provide sophisticated information retrieval services to novice end-users.
- How this might effect information providers is not yet completely
- understood. The users at Peat Marwick found the technology useful for
- day-to-day tasks such as researching potential new accounts and finding
- resources within their own organization. Since these tasks are not
- restricted to the accounting and management consulting industries, we are
- optimistic that this type of technology can be fruitful and productive in
- many corporate settings.
-
- The future of this system, and others like it, depends upon finding
- appropriate niches in the electronic publishing domain. Potential uses
- include making current online services more easily accessible to end-users;
- or allowing large corporations to access their own internal word processor
- files more efficiently. It is also possible that near-term development
- will focus on a single professional field such as patent law or medical
- research.
-
-
- Summary
-
- A unique alliance of four companies with complementary interests in the
- field of information retrieval have jointly developed a prototype which
- gives versatile access to full-text documents. The system allows users to
- retrieve personal, corporate, and wide area information through one
- easy-to-use interface. The WAIS project has shown that current
- technologies can be used to make useful, profitable, and convenient wide
- area information systems. The success of the project has convinced us that
- a WAIS-like system can be a valuable tool for corporate information
- retrieval.
-
-
- Acknowledgements
-
- The design and development of the WAIS Project has been a collective
- effort, with contributions and ideas coming from many people. Among them:
-
- Apple Computer: Charlie Bedard, David Casseras, Steve Cisler, Tom Erickson,
- Ruth Ridder, Eric Roth, John Thompson-Rohrlich, Kevin Tiene, Gitta Soloman,
- Oliver Steele, Janet Vratny-Watts.
-
- Dow Jones News/Retrieval: Clare Hart, Rod Wang, Roland Laird.
-
- Thinking Machines: Dan Aronson, Franklin Davis, Jonathan Goldman, Chris
- Madsen, Harry Morris, Patrick Bray, Danny Hillis, Gary Rancourt, Tracy
- Shen, Craig Stanfill, Steve Swartz, Ephraim Vishniac, David Waltz.
-
- KPMG Peat Marwick: Chris Arbogast, Mark Malone, Tom McDonough, Robin
- Palmer.
-
- Scolex Information Systems: Art Medlar.
-
- Thanks also to Advanced Software Concepts for TCPack software.
-
- ----------------
- Footnotes
-
- 1 Salton, Gerald; McGill, Micheal. Introduction to Modern Information
- Retrieval. McGraw-Hill, 1983.
-
- 2 DowQuest promotional literature available from Dow Jones & Co. Inc., 200
- Liberty Street, New York, NY 10281.
-
- 3 Z39.50-1988: Information Retrieval Service Definition and Protocol
- Specification for Library Applications. National Information Standards
- Organization (Z39), P.O. Box 1056, Bethesda, MD 20817. (301) 975-2814.
- Available from Document Center, Belmont, CA. Telephone 415-591-7600.
-
- 4 Franklin Davis et al. WAIS Interface Protocol Prototype Functional
- Specification, Thinking Machines. Available from Franklin Davis
- (fad@think.com) or Brewster Kahle (brewster@think.com).
-